Huge Parsed Corpora in LASSY
نویسنده
چکیده
One of the goals of the LASSY STEVIN project (Large Scale Syntactic Annotation of written Dutch) is a syntactically annotated (manually verified) corpus of 1 million words. In addition, the full STEVIN reference corpus of 500 million words will be syntactically annotated automatically. In this paper, the potential of such huge treebanks for applications in corpus linguistics, natural language processing and information extraction is illustrated.
منابع مشابه
Large Scale Syntactic Annotation of Written Dutch: Lassy
The construction of a 500-million-word reference corpus of written Dutch has been identified as one of the priorities in the STEVIN programme. The focus is on written language in order to complement the Spoken Dutch Corpus (CGN) [13], completed in 2003. In D-COI (a pilot project funded by STEVIN), a 50-million-word pilot corpus has been compiled, parts of which were enriched with verified synta...
متن کاملParsed Corpora for Linguistics
Knowledge-based parsers are now accurate, fast and robust enough to be used to obtain syntactic annotations for very large corpora fully automatically. We argue that such parsed corpora are an interesting new resource for linguists. The argument is illustrated by means of a number of recent results which were established with the help of parsed corpora.
متن کاملUsing Parsed Corpora for Estimating Stochastic Inversion Transduction Grammars
An important problem when using Stochastic Inversion Transduction Grammars is their computational cost. More specifically, when dealing with corpora such as Europarl only one iteration of the estimation algorithm becomes prohibitive. In this work, we apply a reduction of the cost by taking profit of the bracketing information in parsed corpora and show machine translation results obtained with ...
متن کاملRound trips with meaning stopovers
This paper describes taking parsed sentences, going to meaning representations (the stopover), and then back to parsed sentences (the round trip). Keeping to the same language tests the combined success of building meaning representations from parsed input and of generating parsed output. Switching languages when manipulating meaning representations would achieve translation. Transfer shortfall...
متن کاملEvaluation of parsed corpora: Experiments in user-transparent and user-visible evaluation
$EVWUDFW In the present paper, we describe and discuss the evaluation of parsed corpora, namely the ones that are available on the Web for querying in the AC/DC project. The paper has two parts: the first one suggests a set of different evaluation parameters and measures that are much more illuminating than commonly used simple precision measures, while the second evaluates the parsed corpus fo...
متن کامل